Simple Linear Regression Models

STA 101 - Summer I 2022

Raphael Morsomme

Welcome

Announcements

  • Homework 1 has been returned
  • Homework 2 has been published
    • Feel free to re-use the template from HW 1

Recap of last lecture

  • Histogram, scatterplot, boxplot
  • Average, median, variance, sd and IQR; robustness
  • Frequency, contigency and proportion tables
  • Barplot, mosaic plot
  • Effective communication: well-edited figures, \(\ge3\) variables (symbols, colors, facets)
  • R for Data Science - chapters 3 and 7

Outline

  • simple linear regression model
  • least-square estimates
  • model comparison
  • outliers

Simple linear regression

Regression model

d <- ggplot2::mpg
head(d, n = 4)
# A tibble: 4 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~

Suppose the variable hwy (fuel efficiency on highway) is very expensive to measure.

We decide to estimate it using the other variables. To do so, we will fit a regression model. \[ \text{hwy} \approx \text{model}(\text{other variables}) \]

Simple regression

We expect the variable cty to be a good proxy for hwy.

After all, if a car is efficient in the city, we expect it to also be efficient on the highway! We will therefore consider a simple regression model

\[ \text{hwy} \approx \text{model}(\text{cty}) \]

Simple linear regression

ggplot(d) +
  geom_point(aes(x = cty, y = hwy))

The variables cty and hwy are linearly associated.

We therefore opt for a simple linear regression model

\[ \text{hwy} \approx \beta_0 + \beta_1 \text{cty} \]

Parameters

The numbers \(\beta_0\) and \(\beta_1\) are the two parameters of the model.

We now want to find good values for these unknown parameters.

Let us look at two sets of values:

  • \(\beta_0 = 1, \beta_1 = 1.3 \qquad \Rightarrow \qquad \text{hwy} \approx 1 + 1.3 \text{cty}\)

  • \(\beta_0 = 15, \beta_1 = 0.5 \qquad \Rightarrow \qquad \text{hwy} \approx 15 + 0.5 \text{cty}\)

ggplot(d) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline(intercept = 1, slope = 1.3)

ggplot(d) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline(intercept = 15, slope = 0.5)

Prediction

We can use our models to estimate hwy for new vehicles.

Imagine there is a new vehicle with \(\text{cty} = 30\). Instead of measuring its hwy (expensive), we use our model to estimate it. Using the “good” model gives the following estimate

\[ \begin{align*} \text{hwy} & \approx \beta_0 + \beta_1 \text{cty} \\ & = 1 + 1.3 * 30\\ & = 40 \end{align*} \]

Is this the true hwy of the new vehicle? No! - this is only an estimate based on the value of the variable cty and our “good” model.

Can we do better? Yes!

  • Take additional variable into account in the model (e.g. engine size, vehicle age, etc)
  • Use better values for \(\beta_0\) and \(\beta_1\).

Group exercise - parameters

  1. What is the prediction for the new vehicle if we use the bad model?

  2. copy and paste the following piece of code and try different values for the parameters to find a good set of values.

library(tidyverse)
d <- ggplot2::mpg
beta_0 <- 15
beta_1 <- 0.5
ggplot(d) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline(intercept = beta_0, slope = beta_1)
05:00

Linear association

A simple linear regression model is only applicable if the relation between the predictor and the response is linear.

If the relation is not linear, the simple linear regression is not suitable.

In this case, we need to model the non-linearity (next lecture).

Group exercise - linear association

Exercise 7.3

03:00

Leat-square estimates

Residuals

Our predictions are only approximate.

  • Let us represent our prediction with \(\widehat{\text{hwy}}\) and the true value with \(\text{hwy}\)
  • the error we make is \(\text{hw} - \widehat{\text{hwy}}\)
  • this is called the residual

\[e = \text{hwy} - \widehat{\text{hwy}}\]

Visualizing residuals

  • Black circles: Observed values (\(\text{hwy}\))
  • Pink solid line: Least-squares regression line
  • Maroon triangles: Predicted values (\(\widehat{\text{hwy}}\))
  • Gray lines: Residuals

Residual plot

d %>%
  mutate(
    hwy_hat = 1 + 1.3 * cty,
    resid   = hwy - hwy_hat
    ) %>%
  ggplot() +
  geom_point(aes(cty, resid)) +
  geom_abline(intercept = 0, slope = 0, col = "red")

Group exercise - residuals

Exercises 7.1, 7.17, 7.19

05:00

Good estimates

We want to choose estimates that give a model that fits the data well.

  • a model with a regression line that is close to the data

We want to minimize the residuals.

Minimizing residuals

Perhaps the most natural thing to do is to find the values for \(\beta_0\) and \(\beta_1\) that minimize the sum of absolute residuals \[ |e_1|+|e_2|+\dots+|e_n| \]

For practical reasons, the sum of squared residuals (SSR) is a more common criterion \[ e_1^2+e_2^2+\dots+e_n^2 \]

Why squaring the residuals?

  • can work by hand (pre-computer era)
  • reflects the assumptions that being off by \(4\) is more than twice as bad as being off by \(2\)
  • nice mathematical properties
  • mainstream

Least-square estimates

We find the values for \(\beta_0\) and \(\beta_1\) that minimize the SSR with the R command lm

lm(hwy ~ cty, data = d)

Call:
lm(formula = hwy ~ cty, data = d)

Coefficients:
(Intercept)          cty  
      0.892        1.337  

We denote the least-square estimates with the symbols \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

In our model, we therefore have \[\hat{\beta}_0 = 0.892, \qquad \hat{\beta}_1 = 1.337\]

Visualizing the least square regression line

ggplot(d) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline(intercept = 0.892, slope = 1.337)

Reading R output

R can provide the results of a model in different formats.

m <- lm(hwy ~ cty, data = d)
m

Call:
lm(formula = hwy ~ cty, data = d)

Coefficients:
(Intercept)          cty  
      0.892        1.337  

Printing the model itself provides the least-square estimates. This is sufficient for now.

summary(m)

Call:
lm(formula = hwy ~ cty, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3408 -1.2790  0.0214  1.0338  4.0461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.89204    0.46895   1.902   0.0584 .  
cty          1.33746    0.02697  49.585   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.752 on 232 degrees of freedom
Multiple R-squared:  0.9138,    Adjusted R-squared:  0.9134 
F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

This format contains information that will be useful when we do inference. It is, however, difficult to read.

summary(m) %>%
  tidy() %>%
  kable(digits = 2)
term estimate std.error statistic p.value
(Intercept) 0.89 0.47 1.90 0.06
cty 1.34 0.03 49.58 0.00

This format contains the information necessary for doing inference and is easy to read. This is the format used the textbook.

Exploring the model

We can access the residuals and predictions (fitted values) with the command augment.

m <- lm(hwy ~ cty, data = d)
augment(m)[,1:4]
# A tibble: 234 x 4
     hwy   cty .fitted .resid
   <int> <int>   <dbl>  <dbl>
 1    29    18    25.0 4.03  
 2    29    21    29.0 0.0214
 3    31    20    27.6 3.36  
 4    30    21    29.0 1.02  
 5    26    16    22.3 3.71  
 6    26    18    25.0 1.03  
 7    27    18    25.0 2.03  
 8    26    18    25.0 1.03  
 9    25    16    22.3 2.71  
10    28    20    27.6 0.359 
# ... with 224 more rows

Group exercise - exploring a model

For the third observation of the sample,

  1. what is the value of cty?
  2. what is the predicted value for hwy based on the model?
  3. what is the actual value for hwy?
  4. what is the residual?
  5. did the model over- or under-predict?
04:00

Intepreting the parameters

  • the intercept estimate \(\hat{\beta}_0\) is the prediction for a car with \(\text{cty} = 0\)
    • meaningless in our case
  • the slope estimate \(\hat{\beta}_1\) is the increase in a our prediction for hwy for each additional unit of cty
    • “for each additional unit of cty, we expect hwy to increase by 1.337”

Extrapolation

  • stick to the range of the data
  • the intercept will not always be meaningful

Special case: categorical predictor

Let us create a binary predictor indicating if the car is from \(1999\).

d_binary <- mutate(d, year_binary = if_else(year == 1999, 1, 0))
head(select(d_binary, hwy, year, year_binary))
# A tibble: 6 x 3
    hwy  year year_binary
  <int> <int>       <dbl>
1    29  1999           1
2    29  1999           1
3    31  2008           0
4    30  2008           0
5    26  1999           1
6    26  1999           1

The variable year_binary takes the value \(1\) if the car is from \(1999\) and \(0\) otherwise.

lm(hwy ~ year_binary, data = d_binary)

Call:
lm(formula = hwy ~ year_binary, data = d_binary)

Coefficients:
(Intercept)  year_binary  
   23.45299     -0.02564  

The model equation is

\[ \text{hwy} \approx 23.46 - 0.026 \text{year_binary} \]

For a new car from \(1999\), the variable year_binary takes the value \(1\) and our prediction is

\[ \widehat{\text{hwy}} = 23.46 - 0.026*1 = 23.46 - 0.026 = 23.434 \]

While for a car that not from \(1999\), the variable year_binary takes the value \(1\) and our prediction is

\[ \widehat{\text{hwy}} = 23.46 - 0.026*0 = 23.46 - 0 = 23.46 \]

Model comparison

Alternative model

Let us fit an alternative model using engine size (disp) as a predictor \[ \text{hwy} \approx \beta_0 + \beta_1 \text{displ} \]

ggplot(d) +
  geom_point(aes(displ, hwy))

The least-square estimates for the coefficients are

lm(hwy ~ displ, data = d)

Call:
lm(formula = hwy ~ displ, data = d)

Coefficients:
(Intercept)        displ  
     35.698       -3.531  

Note that the slope coefficient is negative; which makes sense since cars with larger engines would tend to be less efficient.

We now have two models. Which is the best?

  • We could start by looking at the residuals

Comparing residuals

m <- lm(hwy ~ cty, data = d)
m_augment <- augment(m)
ggplot(m_augment) +
  geom_histogram(aes(.resid))

lm(hwy ~ displ, data = d) %>%
  augment %>%
  ggplot() +
  geom_histogram(aes(.resid))

The first model seems to have smaller residuals.

\(\Rightarrow\) choose the first model (cty)!

But looking at a plot can be misleading

  • illusions

  • difficult to compare models with similar residuals

We need a more systematic approach for comparing models.

SSR

Instead of comparing histograms of residuals, we can compute the SSR (sum of squared residuals!)

\[ SSR = r_1^2+r_2^2+\dots+r^2_n \]

  • small residuals will give a small SSR

  • large residuals will give a large SSR

\(\Rightarrow\) choose the model with the smaller SSR!

📋 The textbook uses the term SSE (sum of squared errors).

\(R^2\)

While the SSR can be used to select a model, it is also useful in describing the goodness of fit of the model.

The SST (total sum of squares) is the sum of squared distance to the mean. \[ SST = (x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \dots + (x_n - \bar{x})^2 \] It measures the total amount of variability in the data.

Remember the formula for SSR \[ SSR = (x_1 - \hat{x})^2 + (x_2 - \hat{x}_1)^2 + \dots + (x_n - \hat{x}_n)^2 \] It measures the amount of variability in the data left unexplained by the model.

\(SST - SSR\) is therefore the amount of variation explained by the model: \[ \text{data} = SST = (SST-SSR) + SSR = \text{model} + \text{residuals} \]

The statistic \(R^2\) measures the proportion of variation in the data that is explained by the model.

\[ R^2 = 1-\dfrac{SSR}{SST} = \dfrac{SST-SSR}{SST} = \dfrac{\text{var. explained by model}}{\text{total var.}} \]

Note that \(0\le R^2 \le 1\).

  • good model \(\Rightarrow\) small residuals \(\Rightarrow\) small SSR \(\Rightarrow\) large \(R^2\).
  • great model \(\Rightarrow\) tiny residuals \(\Rightarrow\) tiny SSR \(\Rightarrow\) \(R^2\) close to 1.
  • poor model \(\Rightarrow\) large residuals \(\Rightarrow\) SSR almost as large as SST \(\Rightarrow\) \(R^2\) close to 0.

Computing \(R^2\) in R

To compute \(R^2\) in R, simply use the command glance.

m_glance <- glance(m)
m_glance$r.squared
[1] 0.9137752
lm(hwy ~ displ, data = d) %>% glance %>% .[["r.squared"]]
[1] 0.5867867

The model with cty as a predictor has a \(R^2\) value of \(0.914\), and the model that uses displ has a \(R^2\) of \(0.59\) (worse).

The first model is better!

Group exercise - Interpretation

  • Exercise 7.23 (a-d)
  • Compute \(R^2\) “by hand” in R (do not use glance) for the model with parameters \((\beta_0, \beta_1) = (1, 1.3)\). You can use the following code.
library(tidyverse)
d <- ggplot2::mpg
cty <- d$cty
hwy <- d$hwy
beta_0 <- 1
beta_1 <- 1.3
hwy_hat <- . . .
03:00

Outliers

Remember, in a boxplot, outliers are observations far from the bulk of the data

ggplot(d) +
  geom_boxplot(aes(hwy))

In the context of regression model, an outlier is: an observation that falls far from the cloud of points

Group exercise - outlier

Look at the scatterplot for the mpg data set. Are there outliers?

ggplot(d) +
  geom_point(aes(cty, hwy))

Exercise 7.25

03:00

Outliers, leverage and influential points

  • outliers: observations that fall far from the cloud of points

  • High leverage points: observations that fall horizontally away from the cloud of points

  • influential points: observations that influence the slope of the regression line

  • All influential points are high leverage points.

  • All leverage points are outliers.

  • (Venn Diagram)

Dealing with outliers

In regression, outliers have the potential to highly influence the least-square estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

Warning

Least-square estimates are not robust to the presence of outliers.

Estimates that are robust include

  • least absolute deviation estimates (minimize sum of absolute residuals instead of SSR)
  • Bayesian estimates

Impact of outliers

Let us contaminate the data with an outlier (cty \(=10\) and hwy \(=1000\))

d_contaminated <- select(d, cty, hwy) %>%
  add_row(cty = 40, hwy = 500) # add outlier
arrange(d_contaminated, desc(hwy)) %>% slice(1:5)
# A tibble: 5 x 2
    cty   hwy
  <dbl> <dbl>
1    40   500
2    33    44
3    35    44
4    29    41
5    28    37

…and compare the two regression models. ::: {.cell layout-align=“center”}

lm(hwy ~ cty, data = d)

Call:
lm(formula = hwy ~ cty, data = d)

Coefficients:
(Intercept)          cty  
      0.892        1.337  
lm(hwy ~ cty, data = d_contaminated)

Call:
lm(formula = hwy ~ cty, data = d_contaminated)

Coefficients:
(Intercept)          cty  
    -33.841        3.498  

:::

The value of the slope has almost tripled!

ggplot(d_contaminated) +
  geom_point(aes(cty, hwy)) +
  geom_abline(intercept = -26.046, slope = 3.013, col = "purple")

The regression line not longer fits the data well.

Recap

Recap

  • simple linear regression model \[ \text{hwy} \approx \beta_0 + \beta_1 \text{cty} \]
  • residuals
  • least-square estimates
  • parameter interpretation
  • model comparison with \(R^2\)
  • outliers